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GAGA: A PARSIMONIOUS AND FLEXIBLE MODEL FOR 
DIFFERENTIAL EXPRESSION ANALYSIS 

By David Rossell 
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Hierarchical models are a powerful tool for high-throughput data 
with a small to moderate number of replicates, as they allow sharing 
information across units of information, for example, genes. We pro- 
pose two such models and show its increased sensitivity in microarray 
differential expression applications. We build on the gamma-gamma 
hierarchical model introduced by Kendziorski et al. [Statist. Med. 22 
(2003) 3899-3914] and Newton et al. [Biostatistics 5 (2004) 155-176], 
by addressing important limitations that may have hampered its per- 
formance and its more widespread use. The models parsimoniously 
describe the expression of thousands of genes with a small number 
of hyper-parameters. This makes them easy to interpret and analyt- 
ically tractable. The first model is a simple extension that improves 
the fit substantially with almost no increase in complexity. We pro- 
pose a second extension that uses a mixture of gamma distributions 
to further improve the fit, at the expense of increased computational 
burden. We derive several approximations that significantly reduce 
the computational cost. We find that our models outperform the orig- 
inal formulation of the model, as well as some other popular methods 
for differential expression analysis. The improved performance is spe- 
cially noticeable for the small sample sizes commonly encountered in 
high-throughput experiments. Our methods are implemented in the 
freely available Bioconductor gaga package. 

1. Introduction. A main challenge in microarray and other high-throughput 
experiments is the limited number of replicated measurements that are ob- 
tained for each gene. That is, data is abundant at an overall level but it is 
scarce at the gene level, and, therefore, there is much potential for meth- 
ods that allow for the sharing of information across genes. This feature is 
particularly important due to the small sample sizes that are frequently 
encountered in these studies. 
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Hierarchical models naturally allow for this kind of information sharing. 
Typical examples are Lonnstedt and Speed (2002) and Smyth (2004), who 
modeled gene-specific parameter estimates via hierarchical empirical Bayes 
methods to obtain improved testing procedures. Kendziorski et al. (2003), 
Newton et al. (2001) and Newton and Kendziorski (2003) proposed hier- 
archical models that depend on few hyper-parameters, that is, they greatly 
reduce the dimensionality of the problem. 

We propose a novel approach for massive multiple inference. From here on 
we focus on the analysis of differential expression in microarrays, although 
the approach can be used for other forms of high-throughput data as well. 
The proposed model builds on the gamma-gamma hierarchical model of 
Kendziorski et al. (2003). This model is methodologically and mathemati- 
cally attractive, but has only had a modest effect in the practice of expression 
analysis. We identify some data analysis issues that might be limiting factors 
to prevent a more widespread application of the gamma-gamma model, and 
propose an improved model to address this issues. Our approach combines 
features from Lo and Gottardo (2007) by specifying varying coefficients of 
variation across genes, and Yuan and Kendziorski (2006) by specifying a 
mixture prior on gene-specific parameters that induces gene clustering. 

In particular, the gamma-gamma model assumes that the observations for 
each gene arise from a gamma distribution with common shape parameter 
across all genes and a scale parameter that arises from a hierarchical gamma 
prior. Since the model uses a single gamma prior, we refer to it as the Ga 
model. We find the gamma choice appealing, for it is a flexible family that 
can capture the asymmetric distributional shapes frequently encountered in 
microarrays, even after log-transforming. As we show in this paper, although 
the Ga model is elegant and parsimonious, it fails to provide an adequate 
fit in a number of examples. The main challenge in adding flexibility to the 
model is that it seriously complicates the computations required for model 
fit and inference. We propose an extension in two directions. First, we specify 
a gamma prior on both the shape and the inverse mean parameters (GaGa 
model). The extension is still parsimonious, requiring only one additional 
hyper-parameter, and it can be fit in a computationally efficient manner. 
We develop an algorithm that requires a computational effort comparable to 
the Ga model. In a second extension we specify a gamma prior on the shape 
parameter and a mixture of gamma priors on the inverse mean (MiGaGa 
model). This provides additional flexibility, albeit at the expense of reduced 
model parsimony and increased computational cost. 

In summary, the aim of this paper is to improve differential expression 
analysis by providing a method with higher sensitivity than several standard 
approaches. This is achieved by extending the basic Ga model while main- 
taining its methodological beauty and closed-form inference. The extension 



GAGA: A PARSIMONIOUS AND FLEXIBLE MODEL 



3 



addresses data analytic issues of practical relevance, and which may have 
prevented the more widespread use of the model. 

The paper is structured as follows. In Section 2 we review the Ga model 
and we present its first extension: the GaGa model. We derive expressions 
for posterior probabilities of interest, and point out several schemes to fit 
the model. In Section 3 we propose as a further generalization the MiGaGa. 
For both extensions, the posterior distributions of the gamma shape param- 
eters are known only up to a constant. We refer to this distribution, which 
to our knowledge has not been described before, as the gamma shape dis- 
tribution. In Section 4 we derive useful approximations for this distribution. 
In Section 5 we outline how to find differentially expressed (DE) genes, and 
in Section 6 we apply our approach to simulated data and several exam- 
ples. Some concluding remarks follow in Section 7. The methods described 
in this paper are implemented in the R package gaga, available as part of 
Bioconductor 2.2. 

2. The GaGa model. We assume that the data has been background 
corrected, normalized and quantified in a sensible manner [Dudoit et al. 
(2002); Stafford (2008)]. Let Xij be the measure of expression for gene i, 
i = 1, ... ,n, in microarray j, j = 1, . . . , J. Let Zj E {1, . . . , K} indicate group 
membership, for example, Zj = 1 for normal cells and Zj = 2 for cancer cells. 
We denote the vector of observations for gene i as Xj and the whole data as x. 
We use Ga(-) to denote a gamma distribution, IGa(-) for the inverse gamma, 
Mult(-) for the multinomial, Dirichlet(-) for the Dirichlet and GaS(-) for 
the gamma shape distribution. The GaS distribution arises as the posterior 
distribution of the gamma shape parameter conditional on the observed 
data and on some other model parameters. To our knowledge it has not 
been introduced before, so we present its definition in Section 4. 

In the differential expression problem the investigator is interested in 
determining the expression pattern that each gene follows. This inference 
problem can be viewed as a hypothesis testing problem. Throughout we use 
the terms hypothesis and expression pattern interchangeably. The simplest 
setup is having K = 2 groups and 2 hypotheses: pattern under which both 
groups are equally expressed (null hypothesis) and pattern 1 under which 
they are differentially expressed (alternative hypothesis). For K > 2 we may 
want to consider more than 2 patterns. For example, if group 1 corresponds 
to normal cells, group 2 to cells with type A cancer and group 3 to type B 
cancer, one may be interested in assigning each gene to one of the following 
patterns: 

Pattern 0: Normal = Cancer A = Cancer B, 
(1) Pattern 1: Normal ^ Cancer A = Cancer B, 

Pattern 2: Normal 7^ Cancer A 7^ Cancer B. 
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Denote by H the number of hypotheses, and let the latent variable Si 6 
{0, 1, . . . ,H — 1} indicate the true expression pattern for gene i. We refer 
to genes with 5. L = as equally expressed (EE) and genes with Si ^ as 
differentially expressed (DE), and we denote S = (S±, . . . , S n ). 

2.1. The model. The Ga model [Kendziorski et al. (2003); Newton et al. 
(2001); Newton and Kendziorski (2003)] assumes that independent 
realizations from Ga(aj,Aj iZj ) (i.e., the mean is cti/\i tZj ). The model as- 
sumes 5i ~ Mult(l,7r), fixes oti = a for all i and specifies the hierarchical 
prior \{ tZj ~ Ga(ao,i / ) for all distinct scale parameters under pattern Si. 
Here (oto, v, a, it) are hyper-parameters common to all genes. For Si = (EE 
genes) we have A^i = • • • = \i t K, and for Si / some of the Xi iZj are different 
from each other, according to the specification of the hypotheses. 

The Ga model imposes the restriction that l/yfal, the within-groups co- 
efficients of variation (CV), must be constant across all genes and groups. 
The assumption is analytically convenient, but we have found it not to 
be reasonable in typical data. Figure 1(a) shows empirical CVs for the 
Armstrong data, described in Section 6. The sample CVs and mean ex- 
pressions are roughly independent. This does not confirm, however, the 
constant CV assumption. If the true CVs were indeed constant, the sam- 
ple CVs should be similar to each other (up to estimation error), whereas 
we observe CVs that range from 0.005 to 0.7. We conducted a simulation 
study and confirmed that, under the constant CV assumption, the range 
of sample CVs should be much smaller. To assess the extent to which 
this lack of fit affects the inference, Figure 1(a) highlights the genes de- 



(a) (b) 
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Mean expression (log-scale) Expression levels (log-scale) 

Fig. 1. Armstrong data, (a) Sample mean and CV for each gene (* denotes genes declared 
DE by the Ga model), (b) Marginal distribution of data (log 2 scale vs. prior predictive of 
GaGa and MiGaGa with M = 2 and uj — uj). 
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clared DE by the Ga model. These are mostly genes with above aver- 
age CV. This is due, we believe, to the constant CV assumption, which 
makes the Ga model view atypical CVs as evidence for differential expres- 
sion. 

In practice an analyst would certainly interrogate the identified genes 
and down- weight cases like those identified in Figure 1(a). Still, it is desir- 
able to allow a.i in the model to depend on the gene % and automate this 
step when possible. The main difficulty with this generalization is that it 
seriously complicates computations. We propose a generalization that ad- 
dresses this limitation in a computationally efficient manner. We introduce 
gene-specific shape parameters a, and assume Xij ~ Ga(aj, ai/\i >Zj ) (i.e., 
Xi tZj is the mean), with the following hierarchical prior: 

Aj /t|(5j, oq, v ~ IGa(ao, ao/u), indep. for i = 1, . . . , n 

(2) 

ai\5i,(3,[i ~ Ga(/3,/9//x), indep. for i = 1, . . . ,n, 

and a prior for 5i as before. We refer to (2) as the GaGa model. As in the Ga 
model, when 5i = we have \n = ■ ■ ■ = XiK, whereas under 5i / some of 
them are different from each other (although they still arise from the same 
marginal distribution). The GaGa model replaces the hyper-parameter a 
of the Ga model with the pair (/?,//). That is, the additional flexibility is 
achieved with only one more hyper-parameter. 

Note that we assume that ai is constant across groups. Allowing it to 
vary across groups would allow to compare not only mean expression lev- 
els but the full distribution, which is biologically relevant (Lapointe et al. 
(2004); Coombes, Wang and Baggerly (2007)). Even though in this paper 
we focus on the comparison of Xi, z . , we have also derived and implemented 
the comparison of aj in our Bioconductor package gaga. 

2.2. Posterior distributions. We derive the posterior probability that a 
gene follows each expression pattern, which is needed to obtain lists of dif- 
ferentially expressed genes. We also present the posterior distribution of the 
first-stage parameters, which are needed to obtain fold change estimates. 
Both posterior probabilities and distributions are derived assuming that the 
hyper-parameters u = (ao, z/, (3, /i, 7r) are fixed, as was done for the Ga model. 
We denote the vector of means for a single gene as Aj = (Aj,i, . . . , Xi,k) and 
we let A = (Ai, . . . , A n ), a = (a±, . . . , a n ) be the collection of these param- 
eters. From (2) we see that, conditional on u, the gene-specific parameters 
(5i,ai, Ai) are independent a posteriori across genes i = 1, . . . , n. Therefore, 
it suffices to derive the posterior for each gene separately. 

Let Pi be the log-product of the observations for gene i, that is, Pi = 

iogn = 

i x^, and let Ns i be the number of groups that are distinct under 
pattern <5j. In our example in (1) we have H = 3 patterns: under pattern 
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we have Nq = 1 distinct groups, and, similarly, N\ = 2, N2 = 3. Let Si^k f° r 
i = 1, . . . , n, Si = 0, . . . , H — 1 and k = 1, . . . , iV^ be the sum of observations 
from gene i that under pattern <5j correspond to the fcth distinct group and 
JiSik be the number of terms in the sum. In our example Sio,o,i denotes the 
sum of all observations from gene 10 (since under pattern there is only one 
distinct group), Sio,i,i denotes the sum of observations from normal samples 
(since it is the first distinct group under pattern 1) and 610,1,2 the sum from 
cancers of type A and B. We denote by and 3^ the collection of Si t s it k 
and J i)Sijk for k = 1, ...,N Si . 

The posterior probability that gene i follows expression pattern I, which 
we denote as vu, is given by vu = P(5i = £|x, u)) oc /(xj|<5j = 1,u)tti for I = 
0, .. .,H — 1, where /(xj|Jj,u;) is equal to 



(3) 



{a /v) a °(J3/n)f l l N 'i 1 % 



r(a )T(J3) 



n C(J iA ,/3,/3///-P»,a ,ao/i/, S iA ) : 



and C(-) is the gamma shape normalization constant, defined in Section 4. 

The distribution of (o^, Aj&) conditional on the observed data, 5, and the 
hyper-parameters w is 

Oi|(5i,u;,x ~ GaS(J» a;, /3, /3//i - Pj, a , ao/z^, S ij(5i ), 

(4) 

Ai,fc|aj,<5i,w,x~ IGa(QjJ ij( 5 iifc + a ,ao/^ + o-iSi^k)- 

For any given a;, (4) can be used to obtain posterior credibility intervals in 
the usual fashion. Note that a, and Aj k are n °t conditionally independent 
a posteriori given as they are a priori. 

2.3. Model fitting. The model can be fit by estimating {olq,v, (3, /j,,tt) 
with an empirical Bayes argument. To this end, we implemented an expectation- 
maximization algorithm completely analogous to that for the Ga model 
[Kendziorski et al. (2003), Appendix]. The EM algorithm is described in 
the Supplementary Material [Rossell (2009)]. Alternatively, fully Bayesian 
model fitting schemes which specify a hyper-prior on (ao, v, (3, f-t, 7r) are also 
possible and are provided in the Supplementary Material [Rossell (2009)]. 
Both EM and fully Bayesian approaches are implemented in our gaga pack- 
age, and they usually deliver virtually identical results. This is to be ex- 
pected, as microarray data is strongly informative about parameters that 
are common to all genes. 

Both the empirical and fully Bayesian algorithms require the evaluation of 
the normalization constants in the posterior distribution of the gamma shape 
parameters a± , . . . , ct n , which are not available in closed form. In Section 4 
we derive useful approximations that reduce the computational burden. 
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3. The MiGaGa model. The GaGa model addresses the problem illus- 
trated in Figure 1(a) by allowing varying CVs across genes. However, an- 
other limitation remains. In practice, some normalization procedures, such 
as RMA [Irizarry et al. (2003)] or GCRMA [Wu et al. (2004)], oftentimes 
result in a distinctly bimodal distribution for the gene expressions. Figure 
1(b) shows a kernel density estimate of the density of Xij for the Armstrong 
data (see Section 6.1), and compares it with the prior-predictive under the 
GaGa model. The model does not capture the bimodality. To address this 
limitation, we introduce a further generalization, by letting Aj^ arise from 
the mixture 

M 

\,k\5i,p,ai Q ,v ~ ^2 PmJGa(a 0m ,a 0m /u m ), 

m=0 

p ~ Dirichlet(r), 

where M is the number of components in the mixture. The rest of the 
model is as in (2). The posterior distributions and model fitting procedures 
are largely analogous to that for the GaGa model and are detailed in the 
Supplementary Material [Rossell (2009)]. The main difference is that for 
the MiGaGa one introduces latent variables indicating the cluster to which 
each gene belongs. 

Compared to the GaGa prior, the additional flexibility in MiGaGa poten- 
tially allows us to obtain a better fit to the data, albeit this comes at the cost 
of increased model complexity and computational burden. Figure 1(b) shows 
how the MiGaGa prior predictive improves the GaGa fit substantially. We 
selected M = 2 clusters as they capture the bi-modality observed in Figure 
1(b). In general, one can either select M maximizing some criterion such 
as the BIC [Schwarz (1978)] or simply fit a MiGaGa model with large M. 
In the latter case, after the model fit one could remove the clusters with 
estimated weights p m below some threshold. 



4. The gamma shape distribution. The posterior distribution of the shape 
parameter a% in (2), which we refer to as gamma shape distribution, has not 
been described before. It is similar to the distribution that arises when the 
parametrization is in terms of the shape and scale parameters [Damsleth 
(1975); Miller (1980)]. To simplify notation, we denote by y a positive con- 
tinuous random variable that follows this distribution. Its probability den- 
sity function, indexed by the parameters a = (a±, . . . , a p ) where a, > 0, b > 0, 
d > 0, r > 0, s = (si, . . . , s p ) where Sj > 0, c > — Y%=i a i l°g( s i/ a i)> can be 
written as /(y|a, 6, c, d, r,s) = 

(o) CWA v, s| ^, e -,^m-"' 

, = i T(y) 1 \r + Siy 
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for y > 0. C(a, b, c, d, r, s) is the normalization constant and T(-) is the 
gamma function. For a± = ■ ■ ■ = a p = d = 0, (6) simplifies to a gamma distri- 
bution. 

In general, to obtain random draws from (6) or to evaluate C(a, b, c, d, r, s), 
one has to resort to numerical methods. This is impractical in our setup, 
since both the EM algorithm and the fully Bayesian fitting schemes [see 
Supplementary Material in Rossell (2009)] require performing these steps 
a very large number of times. Approximations are required to decrease the 
computational burden. 

We start by deriving an approximation to (6) that is appropriate for large 
values of y. By approximating T(-) with Stirling's formula and evaluating the 
limit of the resulting expression as y — > oo and aiy + d — > oo, we find that (6) 

is roughly proportional to a Ga(6 + 0.5^^=1 ( a « ~~ 1)> c + J2i=i a i l°g( s i/ a i))- 
This gives a straightforward manner to obtain approximate draws from (6). 
To approximate C(a, b, c, d, r, s), one could simply use the normalization con- 
stant of the approximating gamma distribution. Instead, we use an alterna- 
tive approach that in practice improves the quality of the approximation. 
Denote as g(y) the probability density function of the gamma approxima- 
tion, and let m be its mode. Evaluating g and (6) at m gives 

C(a, b, c, d, r, s) 



In the Supplementary Material [Rossell (2009)] we show examples com- 
paring the gamma shape distribution and its gamma approximation. In our 
examples the approximation error is below 10 -5 for the density and 10~ 14 
for C(a, b, c, d, r, s) . In the microarray data that we have analyzed so far 
the approximation worked well, as most coefficients of variation are <C 1 
and, therefore, the posterior distribution is centered around large y val- 
ues. Also, ai > 1 as it is the sample size in group i and d > 0, and, hence, 
cny + d is also large. In some rare cases we detected that the mode of the 
approximation did not match that of (6) (indicated by the first derivative of 
l°g[/(y| a i b, c, d, r, s)] not being close to zero). In these cases we used a few 
Newton-Raphson steps to locate the mode and used the gamma approxi- 
mation that matches the location of the mode as well as the value of the 
second derivative of log[/(j/|a, b, c, d, r, s)] evaluated at the mode. 

5. Inference. We formalize inference for differential expression by mini- 
mizing the Bayesian false negative rate (BFNR) subject to an upper bound 
on the Bayesian false discovery rate (BFDR) [Miiller, Parmigiani and Rice 
(2007)]. Briefly, BFNR is the posterior expected false negative rate (i.e., 



(7) 
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genes declared EE that are actually DE), and BFDR is the posterior ex- 
pected false positive rate (i.e., genes declared DE that are actually EE). 
The definition is analogous to the frequentist FDR and FNR definitions, 
and it remains valid for more than two hypotheses. The Bayes rule is to 
declare a gene as DE whenever its posterior probability of DE is above 
a certain threshold. Miiller, Parmigiani and Rice (2007) provided a simple 
expression to determine the threshold. The result extends trivially to our 
multiple hypotheses setup with a slight adjustment: given that a gene is not 
classified into pattern 0, we propose assigning it to the pattern with the high- 
est posterior probability. That is, for given BFDR and BFNR we maximize 
the number of genes correctly classified into their expression pattern. 

Since the posterior probabilities in Section 2.2 are derived under an as- 
sumed probability model, deviations from the model assumptions may result 
in poor performance of the procedure. In Section 6.2 we assess its frequentist 
operating characteristics by bootstrapping one example. For further details 
see the supplementary material [Rossell (2009)]. 

6. Results. We assess the performance of the GaGa and MiGaGa models 
in simulated data and several examples. We fit MiGaGa with M = 2 clusters, 
as we believe it offers a reasonable compromise between model flexibility 
and computational speed. In Section 6.1 we analyze the leukemia data of 
Armstrong et al. (2002), and in Section 6.2 we use this data to conduct 
several simulation studies. We also analyze the Affymetrix data from the 
MAQC study [MAQCconsortium (2006)]. This study is a valuable resource, 
as the differential expression status of over 1,000 genes was validated via 
quantitative PCR Taqman assays. The models were fit via an EM scheme, 
as described in Section 2.3. 

We compare our methodology to the Ga model, BRIDGE [Gottardo et al. 
(2006)], limma with Benjamini-Hochberg p- value adjustment (limma-BH) 
[Smyth (2005); Benjamini and Hochberg (1995)], the Significance Analy- 
sis of Microarrays (SAM) of Tusher, Tibshirani and Chu (2001), and a t- 
test/-F-test with beta-uniform mixture p- value adjustment (t-BUM) 
[Pounds and Morris (2003)]. For limma-BH, SAM and t-BUM we use log2- 
transformed data BRIDGE to automatically log-transform the data, so we 
gave un-logged data as input to the routine. For Ga, GaGa and MiGaGa 
we use the original scale, as these methods have been designed to work with 
positive real values. All methods are used as implemented in their respective 
Bioconductor packages [Gentleman et al. (2004)]. Ga is one of the methods 
implemented in the package EBarrays [Kendziorski, Newton and Sarkar 
(2005)]. We do not perform a comparison with the lnnmv procedure within 
EBarrays, which formulates a log- normal model that allows genes to have 
different variances in a manner similar to limma. BRIDGE is implemented 
in bridge [Gottardo (2004)], limma-BH in limma [Smyth (2005)], SAM in 
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siggenes [Schwender (2007)], and i-BUM in ClassComparison [Coombes 
(2007)]. 

All methods were set up to control the FDR below 0.05. 

6.1. Armstrong data. The data (http://www.broad.mit.edu/cgi-bin/ 
cancer/publications/pub_paper.cgi?mode=view&paper_id=63) consists of 24 
Affymetrix U95A arrays from acute lymphoblastic leukemia (ALL) sam- 
ples, 18 U95A arrays from lymphoblastic leukemia with MLL translocations 
(MLL), and 2 U95Av2 arrays also from the MLL group. The U95Av2 arrays 
were obtained at a later date than the rest, possibly under different exper- 
imental conditions, so we excluded them from the analysis. The data also 
contained samples with acute myelogenous leukemia, but for illustration we 
restrict attention to the ALL and MLL groups. 

The data was background corrected, normalized and summarized us- 
ing the function just.gcrma from the R package gcrma [Wu and Irizarry 
(2007)]. 

6.1.1. Model fit. Figure 1(a) reveals a violation of the constant CV as- 
sumption of the Ga model, and that the model tends to flag genes with large 
CVs as DE. Figure 1(b) shows that a MiGaGa fit with M = 2 components 
describes the data better than a single-component GaGa. Further assess- 
ment of goodness of fit can be found in Supplementary Material [Rossell 
(2009)]. 

6.1.2. Differential expression analysis. To study the behavior of the meth- 
ods under small sample sizes and evaluate the reliability of the results, we 
analyze multiple random subsets of chips and report averaged results. We 
start by fitting the model to 5 randomly chosen arrays from each group. We 
then add 5 more arrays per group, then 10 and finally we analyze the full 
data set. We repeat this process 20 times. 

Table 1 shows the average number of genes declared DE when analyzing 
a subset of 5, 10 and 15 arrays per group, as well as the full data. The table 
also provides the percentage of reproducibility, that is, how many among 
those genes were found again when analyzing the full data set. For instance, 
with 5 arrays per group MiGaGa found 61.5 genes on the average, 86.0% of 
which were confirmed in the full data. Ga was the method declaring the most 
genes as DE. The observed lack of fit of the Ga model and the simulation 
study conducted in Section 6.2 suggest that Ga produces lists of DE genes 
with an FDR well above the desired 5%. For most sample sizes, GaGa and 
MiGaGa found more genes than the remaining competitors, and presented 
reasonably high reproducibility. The advantage is especially noticeable for 
small sample sizes, where GaGa and MiGaGa find at least twice as many 
genes as the competitors. 
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Table 1 

Gene discoveries in the Armstrong data. # DE: average number of genes declared DE; % 
rep. : average percentage of # DE also found when analyzing the full data 





5 


arrays 


10 


arrays 


15 


arrays 


All data 




#DE 


% rep. 


#DE 


% rep. 


#DE 


% rep. 


#DE 


GaGa 


58.5 


0.856 


431.0 


0.893 


784.0 


0.889 


991 


MiGaGa (M = 2) 


61.5 


0.860 


445.0 


0.893 


815.0 


0.890 


1040 


Ga 


900.0 


0.810 


1261.0 


0.885 


1526.0 


0.918 


1744 


BRIDGE 


21.5 


0.944 


182.5 


0.960 


439.0 


0.959 


716 


limma-BH 


21.5 


0.947 


181.5 


0.957 


543.0 


0.946 


972 


SAM 


0.0 


0.937 


274.0 


0.973 


804.0 


0.956 


1477 


t-BUM 


7.5 


1.000 


168.5 


0.975 


586.0 


0.965 


1194 



6.2. Simulation study. We conduct a parametric and a nonparametric 
simulation study. For both we use the Armstrong data, so that the simula- 
tions are as realistic as possible. We simulate 200 data sets, conduct analyses 
analogous to those in Section 6.1 and compute average power, FDR, Receiver 
Operating Characteristics (ROC) curves and areas under the curve (AUC). 

For the parametric simulation, we generate data from the posterior pre- 
dictive distribution of the GaGa model fit to the Armstrong data. That is, 
gene expressions are gamma distributed and, for each gene, the simulated 
means and CVs are consistent with the sample means and CVs. The genes 
DE status are determined based on the DE posterior probabilities. 

For the nonparametric simulation, we also determine which genes are dif- 
ferentially expressed using the GaGa posterior probabilities from Armstrong 
data. Expression values for EE genes are generated by re-sampling the Arm- 
strong data arrays with replacement, regardless of which group they came 
from. This conserves both the marginal distribution for each gene and also 
the correlations between EE genes. For DE genes, we again sample arrays 
with replacement, but the ALL data is simulated by sampling from the ALL 
group only, and similarly for MLL data. We repeat this process 200 times 
and report average power and FDR. 

6.2.1. Differential expression. Table 2 shows the observed FDR and power 
for several sample sizes. As suggested by the lack of fit discussed above, Ga 
presents FDR rates well above the desired 5%. In the parametric simula- 
tions, GaGa and MiGaGa appropriately control the FDR below 5% and 
they present a higher average power than the remaining competitors for all 
sample sizes. In the nonparametric simulations, the advantage in power is 
more noticeable, that is, from 0.449 for limma-BH to 0.671 for MiGaGa. 
However, the FDR was slightly above 5% in several scenarios. Among the 
competitors, limma-BH and SAM performed best. 
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6.2.2. ROC curves. We computed the average ROC curve over the 200 
nonparametric simulations with 20 arrays per group. Figure 2(a) shows the 
average FDR and power. GaGa and MiGaGa presented very similar curves 
and dominated uniformly the competing methods. limma-BH, SAM and t- 
BUM performed better than Ga and BRIDGE when the FDR was below 
0.2, that is, the range typically considered in practice. The GaGa AUC was 
significantly lower than the MiGaGa AUC, and significantly higher than the 
other methods' AUC (Wilcoxon paired test, Bonferroni corrected P < 0.01). 
However, the numerical difference between the GaGa and MiGaGa AUCs is 
of no practical relevance. 

6.3. MAQC study. The MicroArray Quality Control (MAQC) project 
was initiated to assess the reliability and reproducibility of findings obtained 
from microarray experiments. Expression data was obtained for four titra- 
tion pools (A, B, C and D) generated from two distinct reference RNA sam- 
ples, at multiple sites and using several technology platforms. The two RNA 
samples types were Universal Human Reference RNA (UHRR) from Strata- 
gene and a Human Brain Reference RNA (HBRR) from Ambion. The four 
pools included the two reference RNA samples and two mixtures: Sample A, 
100% UHRR; Sample B, 100% HBRR; Sample C, 75% UHRR:25% HBRR; 
and Sample D, 25% UHRR:75% HBRR. Confirmatory qPCR assays were 

Table 2 

Average FDR and power for different sample sizes. Data simulated from GaGa posterior 

predictive 

5 arrays 10 arrays 15 arrays 20 arrays 

FDR Power FDR Power FDR Power FDR Power 



Parametric simulation 



GaGa 







Oil 





.066 


0.000 


0.322 


0. 


.007 


0. 


,512 





,002 





,608 


MiGaGa (M 


= 2) 





.011 





.066 


0.000 


0.328 





,008 





,520 





,002 





.615 


Ga 







.159 





.434 


0.133 


0.587 


0. 


,117 





667 





,105 





.712 


BRIDGE 







.000 





.034 


0.016 


0.232 


0. 


032 


0. 


424 





,022 





.553 


limma-BH 







.012 





.063 


0.035 


0.288 





034 


0. 


,487 





,036 





.580 


SAM 







.000 





.000 


0.044 


0.272 





043 





,492 





,042 





.582 


t-BUM 







.065 





.021 


0.040 


0.266 





,034 





,480 





,042 





.577 












Nonparametric 


simulation 
















GaGa 







.043 





.054 


0.066 


0.319 


0. 


067 


0. 


,529 





,065 





.660 


MiGaGa (M 


= 2) 





.047 





.057 


0.066 


0.327 





,068 





,541 





,068 





.671 


Ga 







.342 





.397 


0.289 


0.567 


0. 


,254 


0. 


666 





,239 





.740 


BRIDGE 







.099 





.048 


0.045 


0.133 





035 





,240 





,041 





.339 


limma-BH 







.047 





.029 


0.021 


0.168 





019 





321 





,024 





.449 


SAM 







.049 





.020 


0.050 


0.197 





,051 





360 





,048 





.481 


t-BUM 







.070 





.022 


0.043 


0.156 





,053 





,324 





,048 





.454 
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(a) (b) 




Fig. 2. ROC curves, (a) Averaged over Armstrong bootstrap simulations. AUC: 
GaGa= 0.781, MiGaGa — 0.782, Ga = 0.707, BRIDGE = 0.595, limma-BH = 0.631, 
SAM= 0.620, t-BUM= 684. (b) MAQC data. AUC: GaGa= 0.0585, MiGaGa = 0.0581, 
Ga= 0.0570, BRIDGE^ 0.0567, limma-BH= 0.0596, SAM= 0.0580, t-BUM= 0.0599. 



performed for the 1,296 genes with the largest i-test statistic values com- 
paring groups A vs. B. qPCR assays have a much higher sensitivity than 
microarrays, and are therefore a powerful tool to assess the DE status of a 
selected list of genes. 

We restrict attention to the 20 Affymetrix hgul33plus2 arrays obtained 
in the first site, and we analyze the microarray data as pre-processed in 
MAQCconsortium (2006). We consider that qPCR confirms that a gene is 
DE whenever the limma .F-test unadjusted p-value for the qPCR data is 
less than 0.05. 

6.3.1. Differential expression. When fitting the GaGa and MiGaGa (M = 
2) models we reflect the experimental setup considering the following 5 hy- 
potheses for each gene: 

Pattern 0: A = C = D = B, 

Pattern 1: A / C = D = B, 

(8) Pattern 2: A = C/D = B, 

Pattern 3: A = C = D / B, 

Pattern 4: A/C/D/B. 

As groups C and D contain both UHRR and HBRR, one expects a priori 
that most genes follow either Pattern or Pattern 4. Patterns 1-3 include 
DE genes for which not all titrations are different. For instance, Pattern 
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1 contains genes for which 25% HBRR is enough to modify its expression 
levels as much as 100% HBRR. 

GaGa assigns 20272, 0, 1429, 3935 and 29039 genes to patterns 0, 1, 2, 
3 and 4 (respectively), whereas MiGaGa assigns 16328, 0, 1323, 3697 and 
33327. That is, most of the differentially expressed genes are assigned to 
Pattern 4, as expected. We assessed the biological interpretation of these 
gene lists by conducting a gene ontology enrichment analysis. To this end, 
we created (i) a list with the 1000 genes with the highest posterior proba- 
bility for Pattern 4 and that presented group B mean > group A mean, and 
(ii) another list with 1000 with highest posterior probabilities for Pattern 4 
and group A mean > group B mean. After obtaining these two lists both 
for GaGa and MiGaGa, we tested for enriched biological process categories 
using the DAVID software [Dennis et al. (2003)]. The highest enrichment 
in genes over-expressed in group B with respect to group A was observed in 
transmission of nerve impulse, synaptic transmission, nervous system devel- 
opment, cell-cell signaling and cell communication. All these functions are 
needed specifically in the brain, and, hence, they should indeed be enriched 
in HBRR. The highest enrichment in genes under-expressed in group B was 
observed for cell cycle, cell cycle phase, mitotic cell cycle, cell cycle pro- 
cess and M phase. These functions are all related to cell proliferation, and 
are expected to be turned off in differentiated organs like the brain. The 
complete listings of the significantly enriched categories were also highly 
consistent with the experiment's biological background, and are provided in 
the Supplementary Material [Rossell (2009)]. 

6.3.2. ROC curves. We compare our approach to the competing meth- 
ods by computing ROC curves, using only qPCR- validated genes. It should 
be noted that, as only genes with large t-test statistic values were selected 
for qPCR validation, this puts the t-test based methods (limma-BH, SAM, 
t-BUM) at an advantage with respect to Ga, GaGa, MiGaGa and BRIDGE. 
That is, a number of the potentially interesting genes found by the latter 
methods were never validated due to having small to moderate t-test statis- 
tic values. 

For Ga, we specify the same 5 expression patterns considered for GaGa 
and MiGaGa. For the remaining methods, we simply test the null hypothesis 
(Pattern 0) against the full alternative (Pattern 4). As the Bioconductor 
bridge package currently does not support more than 3 groups, for BRIDGE 
we only analyzed data from groups A and B. 

Figure 2(b) displays the results. The x-axis provides proportion of non- 
validated qPCR genes (i.e., with limma-BH p- value >0.05 for qPCR data), 
and the y-axis the proportion of validated qPCR genes (i.e., p- value >0.05). 
Among the 1296 genes selected for qPCR validation in the original MAQC 
paper, only 7.8% were not confirmed. Therefore, for all considered analysis 
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methods the proportion of nonvalidated genes (x-axis) can at the most be 
7.8%. 

In contrast with Figure 2(a), the differences between methods are not 
large. The t-test based methods present slightly higher AUC than GaGa and 
MiGaGa, which in turn present slightly higher AUC than Ga and BRIDGE. 
Qualitatively, all methods achieve good levels of qPCR validation. 

7. Discussion. We introduced two hierarchical models for high-throughput 
data based on the gamma distribution. This flexible parametric choice al- 
lows to capture the asymmetric data frequently encountered in this field, 
even after log-transformations. GaGa builds on the Ga model by relaxing 
the constant coefficient of variation assumption. This results in a parsimo- 
nious model with a substantial improvement in the model fit and therefore 
in reliability of the resulting inference. The increased generality comes at 
a negligible computational cost. We derived an approximation for the pos- 
terior distribution of the gamma shape parameter that further simplifies 
computation. The second extension, the MiGaGa, increases the model flexi- 
bility by incorporating a mixture prior, at the expense of model parsimony. 
In practice, a mixture with as few as two components may suffice to provide 
a satisfactory fit. We have shown that, in many situations, GaGa achieves 
almost the same degree of performance compared to MiGaGa, and hence 
that it is a sensible default choice. 

The hierarchical nature of the models allows for the sharing of information 
across genes. This is specially beneficial for the small sample sizes often 
encountered in high-throughput experiments. 

We compared our models with two other gamma-based models and three 
t-test and normal linear model based approaches. In simulations and in sev- 
eral examples we have shown how both GaGa and MiGaGa find more genes 
than the competing methods, while controlling the FDR around the desired 
levels. For instance, when analyzing a subset with 5 arrays per group from 
the Armstrong data we detect around 60 differentially expressed genes, while 
the best competing methods found around 20. The fact that around 86% of 
these genes were found again when analyzing the full data gives us confidence 
that these are not spurious findings. Both in parametric and nonpar ametric 
simulations we have seen that GaGa and MiGaGa present improved operat- 
ing characteristics. ROC curves showed potential for substantial increases in 
power at fixed FDR levels. Even in a list of qPCR-validated genes selected 
for having a large t-test statistic, our models performed almost as well as 
the t-test based methods, and delivered biologically meaningful results. 

Some extensions of the model are possible. For instance, the interest might 
be not only in seeking differences in mean expression but also in distribu- 
tional shape. This is frequently of biological interest, since many mutations, 
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deletions and translocations affect only a proportion of the diseased individ- 
uals, and, hence, one expects to see differences in the tail behavior between 
groups. Although not presented in this paper, we derived such an approach 
and implemented it as an option in the Bioconductor gaga library. 

Also, we have not explicitly modeled the dependence between genes. Not 
learning about the dependence structure limits the use of the model in find- 
ing gene networks or gene interactions. Interesting future work will be to 
include dependence. Other possibilities are extending the model to include 
covariate information and study-specific random effects, which would make 
it appealing for meta-analysis purposes, or using the model for sample size 
calculation as in Miiller et al. (2004) or sequential sample size calculation. 
In the latter application, the computational efficiency of the GaGa model 
should prove a major asset. 

Acknowledgments. We thank Peter Miiller and Jens Luders for their 
very useful comments. 

SUPPLEMENTARY MATERIAL 

Supplement to GaGa: A parsimonious and flexible model for differential 
expression analysis (DOI: 10.1214/09-AOAS244SUPP; .pdf). We detail an 
EM algorithm and two fully Bayesian MCMC schemes for model fitting, 
and a Bayesian procedure for FDR control. We also assess model goodness- 
of-fit, assess the quality of the gamma approximation to the gamma shape 
distribution and detail the gene ontology analysis performed for the MAQC 
study. 
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